\[ \definecolor{mathBlack}{RGB}{0,0,0} \definecolor{mathOrange}{RGB}{253, 126, 20} \definecolor{mathLightGreen}{RGB}{32, 201, 151} \definecolor{mathGreen}{RGB}{24, 188, 156} \definecolor{mathYellow}{RGB}{253, 156, 18} \definecolor{mathBlue}{RGB}{52, 152, 219} \definecolor{mathRed}{RGB}{231, 76, 60} \definecolor{mathPurple}{RGB}{111, 66, 193} \]
Intro to Data Analysis
Agenda
- Introduction to Data Analysis
- Frequency Distributions
Readings
- What’s the point?
- Being a Statistician Means Never Having to Say You’re Certain
Introduction to Data Analysis
- UNIT OF ANALYSIS
- POPULATION
- SAMPLE
- N & n
- DESCRIPTIVE STATISTICS
- INFERENTIAL STATISTICS
- TIDY DATA
- VARIABLES
- DICHOTOMOUS
- NOMINAL
- ORDINAL
- INTERVAL-RATIO
Unit of Analysis
Who or what is being studied?
Notation
N refers to population size
n refers to sample size
Tidy Data
Measurement Levels
Dichotomous (aka binary)
A variable with only two categories.
Nominal
A variable made up of categories that cannot be ordered
Ordinal
A variable made up of ranked categories, with no systematic or measurable numeric difference between the categories.
Continuous (aka interval-ratio)
A variable with categories that are ordered and expressed in the same units.
Which of the following is an example of descriptive statistics?
- The average age of UofT students is 21 years.
- Students who study more than 10 hours a week are more likely to achieve higher grades
- About 65% of U of T undergrads feels socially connected, based on a sample of 300 students.
- A hypothesis test shows a significant relationship between income and education level
Which of the following is an example of a nominal level of measurement?
- The type of religion a person identifies with (e.g., Christianity, Islam, Judaism)
- The number of children in a household
- The income of a family in dollars
- A ranking of provinces by marriage rate.
Which level of measurement is used when ranking neighborhoods by crime rate (e.g., low, medium, high)?
- Ordinal
- Nominal
- Dichotomous
- Continuous
Frequency Distributions
- FREQUENCY DISTRIBUTION
- RELATIVE FREQUENCY DISTRIBUTION
- PROPORTION
- PERCENTAGE
- CUMULATIVE
- RATE
- BAR GRAPH
- HISTOGRAM
- LINE GRAPH
- STATISTICAL MAP
gss_all$premarsx <- as_factor(zap_missing(gss_all$premarsx))
gss_all$sex <- as_factor(zap_missing(gss_all$sex))
freq_premarsx <- gss_all %>%
select(id, year, sex, premarsx) %>%
filter(year == 2024, !is.na(premarsx)) %>%
count(premarsx)
total_row <- freq_premarsx %>%
summarise(across(where(is.numeric), sum)) %>%
mutate(premarsx = "Total")
# combine
table_premarsx <- rbind(freq_premarsx, total_row)Table 1. Attitudes about sex before marriage
# Render the table
table_premarsx %>%
flextable() %>%
style_flextable()premarsx | n |
|---|---|
always wrong | 357 |
almost always wrong | 122 |
wrong only sometimes | 258 |
not wrong at all | 1,378 |
Total | 2,115 |
Survey question: There’s been a lot of discussion about the way morals and attitudes about sex are changing in this country. If a man and woman have sex relations before marriage, do you think it is _________.
Table 1. Attitudes about sex before marriage
table_premarsx %>%
flextable() %>%
style_flextable() %>%
color(color = "#E74C3C", i = 5, j = "n")premarsx | n |
|---|---|
always wrong | 357 |
almost always wrong | 122 |
wrong only sometimes | 258 |
not wrong at all | 1,378 |
Total | 2,115 |
The number of respondents who answered this survey question.
Table 1. Attitudes about sex before marriage
table_premarsx %>%
flextable() %>%
style_flextable() %>%
color(color = "#E74C3C", i = 3, j = "n")premarsx | n |
|---|---|
always wrong | 357 |
almost always wrong | 122 |
wrong only sometimes | 258 |
not wrong at all | 1,378 |
Total | 2,115 |
The number of respondents who said pre-marital sex was “wrong only sometimes.”
Source: U.S. General Social Survey 2024
Are women more likely than men to say premarital sex is “not wrong at all”?
Table 2. Attitudes about sex before marriage by gender
gss_all$premarsx <- as_factor(zap_missing(gss_all$premarsx))
gss_all$sex <- as_factor(zap_missing(gss_all$sex))
freq_premarsx <- gss_all %>%
select(id, year, sex, premarsx) %>%
filter(!is.na(premarsx), !is.na(sex)) %>%
group_by(sex) %>%
count(premarsx) %>%
pivot_wider(
names_from = sex,
values_from = n,
values_fill = 0
)
# create total row
total_row <- freq_premarsx %>%
summarise(across(where(is.numeric), sum)) %>%
mutate(premarsx = "Total") %>%
select(names(freq_premarsx)) # ensure column order matches
# combine
table_premarsx <- rbind(freq_premarsx, total_row)
# Get number of rows
n_rows <- nrow(table_premarsx) # fixing transparency
# Render the table
table_premarsx %>%
flextable() %>%
style_flextable() %>%
# Manually add zebra stripes with solid colors w/o transparency
bg(i = seq(1, n_rows, by = 2), bg = "white", part = "body") %>%
bg(i = seq(2, n_rows, by = 2), bg = "#F2F2F2", part = "body")premarsx | male | female |
|---|---|---|
always wrong | 4,159 | 7,116 |
almost always wrong | 1,499 | 2,388 |
wrong only sometimes | 3,904 | 4,792 |
not wrong at all | 10,672 | 11,086 |
Total | 20,234 | 25,382 |
Source: U.S. General Social Survey 1972-2024
Proportions are between 0 and 1.0.
Proportion = count (f) / total number of cases (N).
Percentages are between 0 and 100.
Percentage = proportion × 100.
gss_all$premarsx <- droplevels(gss_all$premarsx)
# Create frequency & proportions table
tab <- gss_all %>%
filter(!is.na(premarsx), !is.na(sex)) %>%
group_by(sex, premarsx) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(sex) %>%
mutate(percent = round(100 * n / sum(n), 0)) %>%
ungroup() %>%
pivot_wider(
names_from = sex, values_from = c(n, percent)
)
# Add totals row
tab_totals <- tab %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE)) %>%
mutate(premarsx = "Total")Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(where(is.numeric), sum, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
# Combine with original table
tab_with_totals <- bind_rows(tab, tab_totals)Table 2. Attitudes about sex before marriage by gender
## Pretty table
tab_with_totals %>%
select(premarsx, n_male, percent_male, n_female, percent_female) %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n_male = "n", percent_male = "%",
n_female = "n", percent_female = "%"
) %>%
add_header_row(
values = c("", "Men", "Women"),
colwidths = c(1, 2, 2)
) %>%
align(j = c(2, 3, 4, 5), align = "center", part = "all") %>%
color(color = "#18bc9c", i = 4, j = 4) %>%
color(color = "#fd7e14", i = 5, j = 4) %>%
color(color = "#e74c3c", i = 4, j = 5)Men | Women | |||
|---|---|---|---|---|
premarsx | n | % | n | % |
always wrong | 4,159 | 21 | 7,116 | 28 |
almost always wrong | 1,499 | 7 | 2,388 | 9 |
wrong only sometimes | 3,904 | 19 | 4,792 | 19 |
not wrong at all | 10,672 | 53 | 11,086 | 44 |
Total | 20,234 | 100 | 25,382 | 100 |
\(\frac{\color{mathGreen}{11{,}086}}{\color{mathOrange}{25{,}382}} = 0.4367 \times 100 = \color{mathRed}{43.7\%}\)
TIP: Total of a % column should always sum to 100!
Table 2. Attitudes about sex before marriage by gender
## Pretty table
tab_with_totals %>%
select(premarsx, n_male, percent_male, n_female, percent_female) %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n_male = "n", percent_male = "%",
n_female = "n", percent_female = "%"
) %>%
add_header_row(
values = c("", "Men", "Women"),
colwidths = c(1, 2, 2)
) %>%
align(j = c(2, 3, 4, 5), align = "center", part = "all") %>%
color(color = "#E74C3C", i = 4, j = 3) %>%
color(color = "#E74C3C", i = 4, j = 5)Men | Women | |||
|---|---|---|---|---|
premarsx | n | % | n | % |
always wrong | 4,159 | 21 | 7,116 | 28 |
almost always wrong | 1,499 | 7 | 2,388 | 9 |
wrong only sometimes | 3,904 | 19 | 4,792 | 19 |
not wrong at all | 10,672 | 53 | 11,086 | 44 |
Total | 20,234 | 100 | 25,382 | 100 |
A greater proportion of men (53%) than women (44%) say premarital sex is “not wrong at all.”
Source: U.S. General Social Survey 1972-2024
Table 3. Attitudes about sex before marriage, with cumulative percentages
gss_all$premarsx <- droplevels(gss_all$premarsx)
# Create frequency & proportions table
tab <- gss_all %>%
filter(year == 2024, !is.na(premarsx)) %>%
group_by(premarsx) %>%
summarise(n = n(), .groups = "drop") %>%
mutate(
percent = round(100 * n / sum(n), 0),
cum_percent = round(cumsum(percent), 0)
) %>%
ungroup()
# Add totals row
tab_totals <- tab %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE)) %>%
mutate(premarsx = "Total")
# Combine with original table
tab_with_totals <- bind_rows(tab, tab_totals)
## Pretty table
tab_with_totals %>%
flextable() %>%
style_flextable() %>%
set_header_labels(
n = "n", percent = "%",
cum_percent = "cumulative %"
) %>%
color(color = "#18bc9c", i = 1, j = 3) %>%
color(color = "#fd7e14", i = 2, j = 3) %>%
color(color = "#e74c3c", i = 2, j = 4)premarsx | n | % | cumulative % |
|---|---|---|---|
always wrong | 357 | 17 | 17 |
almost always wrong | 122 | 6 | 23 |
wrong only sometimes | 258 | 12 | 35 |
not wrong at all | 1,378 | 65 | 100 |
Total | 2,115 | 100 | 175 |
\({\color{mathGreen} 17} + {\color{mathOrange} 6} = {\color{mathRed} 23\%}\)
Examples:
- Canada’s divorce rate decreased from 12.7 per 1,000 in 1991 to 5.6 per 1,000 in 2020.
- The 2021 suicide rate of 14.8 per 100,000 population for middle aged Canadians (30-59 years old) was the highest of any age group.
- Canada’s total fertility rate reached a new low in 2023 of 1.26 children per woman.
Nominal variables:
can have frequency distributions, cannot have cumulative frequency distributions
Ordinal:
can have frequency distributions and cumulative frequency distributions
Interval-ratio:
can have frequency distributions, cumulative frequency distributions, and rates
A bar graph is used:
for nominal or ordinal variables,
to show frequencies or percentages,
using separated rectangles, with height proportional
to the frequency or percentage.
A histogram is used:
for interval-ratio variables,
to show frequencies or percentages,
using separated rectangles, with height proportional
to the frequency or percentage.
A line graph is used:
for interval-ratio variables,
to show frequencies or percentages,
joining by category the frequency or average with a line.
A statistical map is used:
for interval-ratio variables,
to show geographical variations, often in ratios,
using variation in color or hue.
Lab 01
- CODEBOOK
will add later